We have chosen Oleksandr Slyvka as our representative, his M value is 6. This gives us year 2018 to analyse.

Libraries

library(eurostat)
library(ggplot2)
library(cowplot)
library(corrplot)
## corrplot 0.95 loaded
library(vtable)
## Loading required package: kableExtra
library(grid)
library(sf)
## Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.4.0; sf_use_s2() is TRUE
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rnaturalearth)
# install.packages("remotes")
# remotes::install_github("ropensci/rnaturalearthhires")
library(rnaturalearthhires)
library(countrycode)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(e1071)

Importing & Explaining Average Salary

data <- get_eurostat("nama_10_fte", time_format = "num")
## Dataset query already saved in cache_list.json...
## Reading cache file /tmp/RtmpYIVRWJ/eurostat/f3a41acbb8d712f391b3e494a3ad2c0c.rds
## Table  nama_10_fte  read from cache file:  /tmp/RtmpYIVRWJ/eurostat/f3a41acbb8d712f391b3e494a3ad2c0c.rds
data <- data[(data$unit == "EUR") & (data$TIME_PERIOD == 2018),]
data <- data[c("geo", "values")]
data
## # A tibble: 28 × 2
##    geo   values
##    <chr>  <dbl>
##  1 AT     44646
##  2 BE     46422
##  3 BG      8127
##  4 CY     21472
##  5 CZ     16137
##  6 DE     43340
##  7 DK     59519
##  8 EA20   35647
##  9 EE     17086
## 10 EL     15920
## # ℹ 18 more rows

Our data includes records for Euro Area (EA20) and European Union (EU27_2020), more over those are records that had not came from year 2018. Those are not standalone countries, so we better delete them. Also it uses EL for Greece (Ellada), we will rename it. Also we have no possession of a record for Netherlands.

Lastly we will replace country codes with country names.

data <- data[(data$geo != "EA20") & (data$geo != "EU27_2020"),]
data[(data$geo == "EL"), "geo"] <- "GR"
data$geo <- countrycode(data$geo, origin = 'iso2c', destination = 'country.name')
data
## # A tibble: 26 × 2
##    geo      values
##    <chr>     <dbl>
##  1 Austria   44646
##  2 Belgium   46422
##  3 Bulgaria   8127
##  4 Cyprus    21472
##  5 Czechia   16137
##  6 Germany   43340
##  7 Denmark   59519
##  8 Estonia   17086
##  9 Greece    15920
## 10 Spain     27426
## # ℹ 16 more rows

Let us define a generic function that will help us display data on a map of European Union.

world_map <- ne_countries(scale = "large", returnclass = 'sf')
europe_bbox <- st_bbox(c(xmin = -20, xmax = 40, ymax = 75, ymin = 30), crs = st_crs(4326))
eu_map <- world_map %>% filter(name %in% data$geo) %>%  st_crop(europe_bbox)

plot_eu <- function(mydata, col, by_data="geo", by_map="name") {
    disp <- left_join(eu_map, mydata, by = setNames(by_data, by_map))
    return (ggplot(disp) +  geom_sf(aes(fill=.data[[col]])) + theme_minimal())
}

Now we will take a look at a map with Average Salaries, histogram of the variable and summary of mean, standard deviation, limits, quartiles and IQR.

hist(data$values, main="Histogram of Average Salary (EUR)", col="cyan")

plot_eu(data, "values") +
    scale_fill_viridis_c(
      name="Average Salary (EUR)", 
      guide = guide_colorbar(barheight = 15, barwidth = 1),
      breaks=10000*(1:6)
    )

We can see that employees in countries on the east are generally paid less. Histogram shows that the distribution is bimodal with two intervals (10-30k and 40-50k), where most countries concentrate.

data %>% slice_max(order_by = values, n = 7)
## # A tibble: 7 × 2
##   geo        values
##   <chr>       <dbl>
## 1 Luxembourg  65570
## 2 Denmark     59519
## 3 Ireland     48211
## 4 Belgium     46422
## 5 Austria     44646
## 6 Germany     43340
## 7 Sweden      43003
summ <- c('mean(x)', 'sd(x)', 'skewness(x)',  'min(x)', 'pctile(x)[25]',
          'median(x)', 'pctile(x)[75]', 'max(x)', 'IQR(x)'
)
sumtable(
  data, 'values',
  out='kable',
  summ=summ
)
Summary Statistics
Variable Mean Sd Skewness Min Pctile[25] Median Pctile[75] Max IQR
values 28001 16339 0.68 8127 14904 22998 42832 65570 27929

Mean is 28k euros, while median is only 22k. From skewness value and max value we might expect that there are some countries where employees are paid a lot higher. Our table with top 7 countries by Average Salaries shows that those countries are Luxemburg and Denmark, first one is famous for its banks and Denmark is known to be a money harbor for its stability and location.

Regressors’ Descriptions

Distance to Voronezh

Our first regressor will be a distance between country’s capital and city of Voronezh. We expect it to positively correlate with Average Salary. To construct such feature we would first get coordinates of each country’s capital.

# Load populated places, including capitals
places <- ne_download(scale = "small", type = "populated_places", category = "cultural", returnclass = "sf")
## Reading layer `ne_110m_populated_places' from data source 
##   `/tmp/RtmpC93Egi/ne_110m_populated_places.shp' using driver `ESRI Shapefile'
## Simple feature collection with 243 features and 137 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -175.2206 ymin: -41.29207 xmax: 179.2166 ymax: 64.14346
## Geodetic CRS:  WGS 84
capitals <- places %>%
  filter(FEATURECLA == "Admin-0 capital", ADM0NAME %in% data$geo)

# View capital names and coordinates
capital_coords <- capitals %>% select(name = NAME, country = ADM0NAME, geometry)

capital_coords <- capital_coords %>%
  mutate(
    lon = st_coordinates(geometry)[, 1],
    lat = st_coordinates(geometry)[, 2]
  )

With coordinates on our hand we will compute the distance using st_distance function from sf. This implementation accounts for Earth’s curvature and computes spherical distance.

points <- st_as_sf(capital_coords, capital_coords = c("lon", "lat"), crs = 4326)

voronezh <- st_sfc(st_point(c(51.67204, 39.1843)), crs = 4326)

points$distance_to_voronezh <- as.numeric(st_distance(points, voronezh)) / 1000
points <- as.data.frame(points)[c("country", "distance_to_voronezh")]
points
##       country distance_to_voronezh
## 1  Luxembourg             3735.200
## 2    Slovenia             3104.403
## 3    Slovakia             2923.548
## 4   Lithuania             2613.100
## 5      Latvia             2809.951
## 6     Croatia             2989.032
## 7     Estonia             2939.923
## 8    Bulgaria             2402.073
## 9       Malta             3273.399
## 10     Cyprus             1678.925
## 11    Hungary             2768.991
## 12    Romania             2188.488
## 13   Portugal             5155.274
## 14     Poland             2757.398
## 15    Ireland             4579.691
## 16    Czechia             3146.524
## 17    Finland             2981.683
## 18    Denmark             3400.385
## 19    Belgium             3864.656
## 20      Spain             4652.032
## 21     Sweden             3248.988
## 22    Germany             3263.342
## 23     Greece             2422.649
## 24    Austria             2979.623
## 25      Italy             3296.564
## 26     France             4011.520
data <- left_join(data, points, by=setNames("country","geo"))
data
## # A tibble: 26 × 3
##    geo      values distance_to_voronezh
##    <chr>     <dbl>                <dbl>
##  1 Austria   44646                2980.
##  2 Belgium   46422                3865.
##  3 Bulgaria   8127                2402.
##  4 Cyprus    21472                1679.
##  5 Czechia   16137                3147.
##  6 Germany   43340                3263.
##  7 Denmark   59519                3400.
##  8 Estonia   17086                2940.
##  9 Greece    15920                2423.
## 10 Spain     27426                4652.
## # ℹ 16 more rows

Let us do a quick data analysis of the variable. This will include drawing a map of it, its histogram, basic descriptive statistics and a scatter plot against Average Salary.

plot_eu(data, "distance_to_voronezh") +
    scale_fill_gradient(
        name = "Distance to voronezh (km)",
        low = "black", high = "white" 
    ) +
    guides(fill = guide_colorbar(barwidth = 1, barheight = 15))

hist(
    data$distance_to_voronezh, 
    main="Histogram of distance to Voronezh (km)",
    xlab="distance to Voronezh (km)",
    col="gray",
    breaks=1000 * (1:6)
)

plot(
    data$distance_to_voronezh, data$values,
    main=sprintf("Distance to Voronezh (km) X Average Salary (r=%.2f)", cor(data$distance_to_voronezh, data$values)),
    xlab="Distance to Voronezh (km)",
    ylab="Average Salary",
    col="black", pch=19
)

sumtable(
  data, 'distance_to_voronezh',
  out='kable',
  summ=summ,
  factor.numeric = TRUE
)
Summary Statistics
Variable Mean Sd Skewness Min Pctile[25] Median Pctile[75] Max IQR
distance_to_voronezh 3200 776 0.67 1679 2779 3047 3374 5155 595

This variable is purely geographical, so we obtain a white-to-black gradient spanning all Europe. Histogram is skewed to the right, it is a result of greater number of countries located on the east of EU. As we have expected Distance to Voronezh correlates positively with Average Salary, Pearson’s coefficient is not large (see scatter plot, coefficient is less than 0.5), but it shows a connection, that the further away a country is from Voronezh, the higher Salary can be expected. Lastly, descriptive statistics agree with an observation about histogram and show a skewness of \(0.67\), a mean of \(3200\) and a standard deviation of \(776\) kilometers.

Illia Lyhin’s preference to visit named country for a leisure week (on 01.05.2018 at 16:00)

Illia Lyhin, being one of the authors of this project, is an inveterate traveller, so surely he has an opinion whether he wants to visit a country or not. We assume it is a good indication about country’s Average Salary, as Mr. Lyhin is rarely happens to be wrong about topics touching monetary relationship. Below we present his view.

lyhin_lvls = c("Low", "Medium", "High")

data$lyhin_preference = "Medium"
data[data$geo %in% c("Austria", "Cyprus", "Czechia", "Denmark", "Ireland", "Portugal"),]$lyhin_preference = "High"
data[data$geo %in% c("Belgium", "Bulgaria", "Hungary", "Romania", "Slovakia", "Lithuania", "Latvia"),]$lyhin_preference = "Low"
data$lyhin_preference <- factor(data$lyhin_preference, levels=lyhin_lvls, ordered=TRUE)
data
## # A tibble: 26 × 4
##    geo      values distance_to_voronezh lyhin_preference
##    <chr>     <dbl>                <dbl> <ord>           
##  1 Austria   44646                2980. High            
##  2 Belgium   46422                3865. Low             
##  3 Bulgaria   8127                2402. Low             
##  4 Cyprus    21472                1679. High            
##  5 Czechia   16137                3147. High            
##  6 Germany   43340                3263. Medium          
##  7 Denmark   59519                3400. High            
##  8 Estonia   17086                2940. Medium          
##  9 Greece    15920                2423. Medium          
## 10 Spain     27426                4652. Medium          
## # ℹ 16 more rows

With data on our hands let us plot it.

color.low = "#EFEA90"
color.mid = "#DFD420"
color.hig = "#B6AC1A"

plot_eu(data, "lyhin_preference") +
  scale_fill_manual(
    name = "Illia Lyhin's Preference",
    values = c(
      "Low" = color.low,
      "Medium" = color.mid,
      "High" = color.hig
    )
  )

par(mar = c(5, 4, 4, 8), xpd = TRUE)
boxplot(
    values ~ lyhin_preference, data=data,
    main="Average Salary by Illia Lyhin's Preference",
    ylab="Average Salary",
    xlab="Illia Lyhin's Preference",
    col=c(color.low, color.mid, color.hig)
)

data %>%
  group_by(lyhin_preference) %>%
  summarise(
    count = n(),
    mean = mean(values),
    sd = sd(values),
    min = min(values),
    median=median(values),
    max = max(values)
  )
## # A tibble: 3 × 7
##   lyhin_preference count   mean     sd   min median   max
##   <ord>            <int>  <dbl>  <dbl> <dbl>  <dbl> <dbl>
## 1 Low                  7 17119. 13110.  8127  13486 46422
## 2 Medium              13 30815. 15112. 13057  27426 65570
## 3 High                 6 34598. 18490. 16137  33059 59519

There is nothing to note about individual records, except excellent choices for a vacation. Box plots show that low priority countries have, indeed, lower Average Salary, though there is an outlier, which is Belgium. Medium and high priority countries behave very similarly. High has a larger spread with standard deviation of \(18.5k\) euros.

Number of people in penitentiary system

One could hypothesize that Average Salary is a measure of stability and well-being. A number of people in the penitentiary system can be another measure of those qualities. It is usually expected that prosperous countries have less crime, so prisons are not as full as those in developing countries. We will use crim_pris_age Eurostat dataset for this data, which provides more data than we need, so we will select only the necessary rows. The criteria are year 2018 and total population (both in terms of sex and age). Also, we can choose between different units, we have decided on “Per hundred thousands inhabitants”, because it is a relative measure, so it is adequate to compare it between different countries. We expect Prisoners per 100K to negatively correlate with Average Salary.

crim <- get_eurostat("crim_pris_age")
## Dataset query already saved in cache_list.json...
## Reading cache file /tmp/RtmpYIVRWJ/eurostat/1f097bcf63e9fa8c8cfdda2532aeb3a2.rds
## Table  crim_pris_age  read from cache file:  /tmp/RtmpYIVRWJ/eurostat/1f097bcf63e9fa8c8cfdda2532aeb3a2.rds
crim <- crim[(crim$age == "TOTAL") & (crim$sex == "T") & (as.Date('2018-01-01') <= crim$TIME_PERIOD) & (crim$TIME_PERIOD < as.Date('2019-01-01')) & (crim$unit == "P_HTHAB"),]
crim[(crim$geo == "EL"), "geo"] <- "GR"
crim$geo <- countrycode(crim$geo, origin = 'iso2c', destination = 'country.name')
crim <- crim[c("geo", "values")] %>% rename(prisoners_per_100K=values)
data <- left_join(data, crim, by="geo")
data
## # A tibble: 26 × 5
##    geo      values distance_to_voronezh lyhin_preference prisoners_per_100K
##    <chr>     <dbl>                <dbl> <ord>                         <dbl>
##  1 Austria   44646                2980. High                          104. 
##  2 Belgium   46422                3865. Low                            90.0
##  3 Bulgaria   8127                2402. Low                            94.3
##  4 Cyprus    21472                1679. High                           71.6
##  5 Czechia   16137                3147. High                          203. 
##  6 Germany   43340                3263. Medium                         79.4
##  7 Denmark   59519                3400. High                           62.9
##  8 Estonia   17086                2940. Medium                        196. 
##  9 Greece    15920                2423. Medium                         99.2
## 10 Spain     27426                4652. Medium                        126. 
## # ℹ 16 more rows

We will conduct the same data analysis for Prisoners per 100K as we have done for Distance to Voronezh.

plot_eu(data, "prisoners_per_100K") +
    scale_fill_gradient(
        name = "Prisoners per 100K",
        low = "white", high = "red" 
    ) +
    guides(fill = guide_colorbar(barwidth = 1, barheight = 15))

hist(data$prisoners_per_100K, main="Histogram of Prisoners per 100K", col="red", xlab="Prisoners per 100K")

plot(
    data$prisoners_per_100K, data$values,
    main=sprintf("Prisoners per 100K X Average Salary (r=%.2f)", cor(data$prisoners_per_100K, data$values)),
    xlab="Prisoners per 100K",
    ylab="Average Salary",
    col="red", pch=19
)

sumtable(
  data, 'prisoners_per_100K',
  out='kable',
  summ=summ,
  factor.numeric = TRUE
)
Summary Statistics
Variable Mean Sd Skewness Min Pctile[25] Median Pctile[75] Max IQR
prisoners_per_100K 120 51 0.65 54 80 104 158 231 78

The number of prisoners is much higher in Eastern countries. As a result distribution is skewed right with a mean \(120\), a median \(104\) and a standard deviation of \(51\) prisoners per hundred thousands inhabitants. This regressor correlates with Average Salary even stronger than Distance to Voronezh(r=-0.55). We can also output countries with lowest and highest prisoners ratio:

data[c(which.min(data$prisoners_per_100K),which.max(data$prisoners_per_100K)), c('geo', 'values','prisoners_per_100K')]
## # A tibble: 2 × 3
##   geo       values prisoners_per_100K
##   <chr>      <dbl>              <dbl>
## 1 Finland    42321               53.6
## 2 Lithuania  13486              231.

Most Profitable NACEr2 A10 Cateogry

We will use one more categorical feature, the most profitable NACEr2 A10 category. It must differentiate between differently structured economies, as different countries have different sources of profit. For that, we will use nama_10_a10 Eurostat dataset. Categories presented in this dataset are not disjoint, but we will categorize them as a part of a bigger union of sectors. As a unit of measure we will use Percents of GDP to stay relative to countries’ scale.

nace <- get_eurostat("nama_10_a10", select_time="A")
## Dataset query already saved in cache_list.json...
## Reading cache file /tmp/RtmpYIVRWJ/eurostat/e7de891f77d50da354191e985f485361.rds
## Table  nama_10_a10  read from cache file:  /tmp/RtmpYIVRWJ/eurostat/e7de891f77d50da354191e985f485361.rds
nace2 <- nace[(nace$na_item == "B1G") & (as.Date('2018-01-01') <= nace$TIME_PERIOD) & (nace$TIME_PERIOD < as.Date('2019-01-01')) & (nace$unit == "PC_GDP"),]
nace2 <- nace2 %>% filter(nace_r2 != "TOTAL")
nace2 <- nace2 %>%
  mutate(nace_r2_sector = dplyr::recode(nace_r2,
    "A"   = "Agriculture, forestry and fishing",
    "B-E" = "Industry",
    "F"   = "Construction",
    "G-I" = "Wholesale, retail, transport, accommodation",
    "J"   = "Information and communication",
    "K"   = "Financial and insurance activities",
    "L"   = "Real estate activities",
    "M-N" = "Professional & admin services",
    "O-Q" = "Public administration, education, health",
    "R-U" = "Arts, entertainment, other services"
  ))
nace2 <- nace2 %>% group_by(geo) %>% top_n(1,values) %>% select(geo, nace_r2_sector)
nace2[(nace2$geo == "EL"), "geo"] <- "GR"
nace2$geo <- countrycode(nace2$geo, origin = 'iso2c', destination = 'country.name')
## Warning: Some values were not matched unambiguously: EA, EA12, EA19, EA20, EU27_2020, UK, XK
data <- left_join(data, nace2, by="geo")
nace_r2_lvls =sort(unique(data$nace_r2_sector))
data$nace_r2_sector <- factor(data$nace_r2_sector, levels=nace_r2_lvls, ordered = FALSE)
unique(data$nace_r2)
## Warning: Unknown or uninitialised column: `nace_r2`.
## NULL
data
## # A tibble: 26 × 6
##    geo      values distance_to_voronezh lyhin_preference prisoners_per_100K
##    <chr>     <dbl>                <dbl> <ord>                         <dbl>
##  1 Austria   44646                2980. High                          104. 
##  2 Belgium   46422                3865. Low                            90.0
##  3 Bulgaria   8127                2402. Low                            94.3
##  4 Cyprus    21472                1679. High                           71.6
##  5 Czechia   16137                3147. High                          203. 
##  6 Germany   43340                3263. Medium                         79.4
##  7 Denmark   59519                3400. High                           62.9
##  8 Estonia   17086                2940. Medium                        196. 
##  9 Greece    15920                2423. Medium                         99.2
## 10 Spain     27426                4652. Medium                        126. 
## # ℹ 16 more rows
## # ℹ 1 more variable: nace_r2_sector <fct>

Let us now display those classes and the Average Salary in them.

color.fin = "#E77D88"
color.ind = "#BDE77D"
color.pub = "#7DE7DC"
color.ret = "#A77DE7"

plot_eu(data, "nace_r2_sector") +
  scale_fill_manual(
    name = "Most Profitable NACE2 A10 Categories",
    values = c(
      "Financial and insurance activities" = color.fin,
      "Industry" = color.ind,
      "Public administration, education, health" = color.pub,
      "Wholesale, retail, transport, accommodation" = color.ret
    )
  )

data.nace2.fin = data[data$nace_r2_sector == "Financial and insurance activities",]
data.nace2.ind = data[data$nace_r2_sector == "Industry",]
data.nace2.pub = data[data$nace_r2_sector == "Public administration, education, health",]
data.nace2.ret = data[data$nace_r2_sector == "Wholesale, retail, transport, accommodation",]

par(mar = c(5, 4, 4, 8), xpd = TRUE)
boxplot(
    values ~ nace_r2_sector, data=data,
    col=c(color.fin, color.ind, color.pub, color.ret),
    main="Average Salary by Most Profitable NACE2 A10 Categories",
    ylab="Average Salary",
    xaxt='n'
)

legend(
    "topright", inset=c(-0.2, 0),
    legend = c("Finance", "Industry", "Administration", "Wholesale"), 
    fill = c(color.fin, color.ind, color.pub, color.ret), 
    title = "NACE2 A10 Categories"
)

data %>%
  group_by(nace_r2_sector) %>%
  summarise(
    count = n(),
    mean = mean(values),
    sd = sd(values),
    min = min(values),
    median=median(values),
    max = max(values)
  )
## # A tibble: 4 × 7
##   nace_r2_sector                          count   mean     sd   min median   max
##   <fct>                                   <int>  <dbl>  <dbl> <dbl>  <dbl> <dbl>
## 1 Financial and insurance activities          1 65570     NA  65570 65570  65570
## 2 Industry                                    9 28461. 15884. 11337 24525  48211
## 3 Public administration, education, heal…     4 46632.  9330. 37583 44712. 59519
## 4 Wholesale, retail, transport, accommod…    12 18315.  6512.  8127 16503  29786

Most countries fall into 3 categories: Industry, Administration and Wholesale, with Luxembourg alone holding to Finance. This leads to a flattened boxplot of Finance. Pooled average of Administration is the highest, with Industry coming second, and Wholesale coming last. Industry sector also shows the biggest spread with a standard deviation of \(16k\) euros.

Daily Internet Use (%)

Our fifth feature is Daily Internet Use coming from the Eurostat dataset isoc_ci_ifp_fu. We have no expectation for this variable to be related to Average Salary, because internet connection is both cheap and widespread.

ifp <- get_eurostat("isoc_ci_ifp_fu")
## Dataset query already saved in cache_list.json...
## Reading cache file /tmp/RtmpYIVRWJ/eurostat/a1dfa1ed9f863c76d4f89b0c7b8f8dcd.rds
## Table  isoc_ci_ifp_fu  read from cache file:  /tmp/RtmpYIVRWJ/eurostat/a1dfa1ed9f863c76d4f89b0c7b8f8dcd.rds
ifp <- ifp[(as.Date('2018-01-01') <= ifp$TIME_PERIOD) & (ifp$TIME_PERIOD < as.Date('2019-01-01')) & (ifp$unit == "PC_IND") & (ifp$indic_is == "I_IDAY") & (ifp$ind_type == "IND_TOTAL"),]
ifp[(ifp$geo == "EL"), "geo"] <- "GR"
ifp$geo <- countrycode(ifp$geo, origin = 'iso2c', destination = 'country.name')
## Warning: Some values were not matched unambiguously: EA, EU15, EU27_2007, EU27_2020, EU28, UK, XK
ifp <- ifp[c("geo", "values")] %>% rename(daily_internet_usage_pc=values)
data <- left_join(data, ifp, by="geo")
data
## # A tibble: 26 × 7
##    geo      values distance_to_voronezh lyhin_preference prisoners_per_100K
##    <chr>     <dbl>                <dbl> <ord>                         <dbl>
##  1 Austria   44646                2980. High                          104. 
##  2 Belgium   46422                3865. Low                            90.0
##  3 Bulgaria   8127                2402. Low                            94.3
##  4 Cyprus    21472                1679. High                           71.6
##  5 Czechia   16137                3147. High                          203. 
##  6 Germany   43340                3263. Medium                         79.4
##  7 Denmark   59519                3400. High                           62.9
##  8 Estonia   17086                2940. Medium                        196. 
##  9 Greece    15920                2423. Medium                         99.2
## 10 Spain     27426                4652. Medium                        126. 
## # ℹ 16 more rows
## # ℹ 2 more variables: nace_r2_sector <fct>, daily_internet_usage_pc <dbl>
plot_eu(data, "daily_internet_usage_pc") +
    scale_fill_gradient(
        name = "Daily Internet Usage (%)",
        low = "cyan", high = "darkgreen" 
    ) +
    guides(fill = guide_colorbar(barwidth = 1, barheight = 15))

hist(
    data$daily_internet_usage_pc, 
    main="Histogrma of Daily Internet Usage (%)", 
    col="darkgreen", 
    xlab="Daily Internet Usage (%)",
    breaks=(10 * (0:5)) + 50
)

plot(
    data$daily_internet_usage_pc, data$values,
    main=sprintf("Daily Internet Usage (%%) X Average Salary (r=%.2f)", cor(data$daily_internet_usage_pc, data$values)),
    xlab="Daily Internet Usage (%)",
    ylab="Average Salary",
    col="darkgreen", pch=19
)

sumtable(
  data, 'daily_internet_usage_pc',
  out='kable',
  summ=summ,
  factor.numeric = TRUE
)
Summary Statistics
Variable Mean Sd Skewness Min Pctile[25] Median Pctile[75] Max IQR
daily_internet_usage_pc 74 9.8 -0.1 53 68 74 81 92 13

Daily Internet Usage has a normal-looking distribution with a heart of online activity in Denmark. Surprisingly, internet usage has a strong positive linear correlation with Average Salary maybe because internet makes dealing with tasks more convenient and thus workers can achieve tasks completion faster.

Let us take a look at the data on our hands and proceed to investigate dependencies between the regressors.

data
## # A tibble: 26 × 7
##    geo      values distance_to_voronezh lyhin_preference prisoners_per_100K
##    <chr>     <dbl>                <dbl> <ord>                         <dbl>
##  1 Austria   44646                2980. High                          104. 
##  2 Belgium   46422                3865. Low                            90.0
##  3 Bulgaria   8127                2402. Low                            94.3
##  4 Cyprus    21472                1679. High                           71.6
##  5 Czechia   16137                3147. High                          203. 
##  6 Germany   43340                3263. Medium                         79.4
##  7 Denmark   59519                3400. High                           62.9
##  8 Estonia   17086                2940. Medium                        196. 
##  9 Greece    15920                2423. Medium                         99.2
## 10 Spain     27426                4652. Medium                        126. 
## # ℹ 16 more rows
## # ℹ 2 more variables: nace_r2_sector <fct>, daily_internet_usage_pc <dbl>

Testing Regressors’ relationship

We will first test relationships between numerical regressors. Dependence between them can be explained with correlation. Pearson correlation will be partially tested during checking multicollinearity, so we will test monotonic relationship for all pairs of numerical regressors with Spearman correlation test:

cor.test(data$distance_to_voronezh, data$prisoners_per_100K, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  data$distance_to_voronezh and data$prisoners_per_100K
## S = 3384, p-value = 0.4422
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1569231
cor.test(data$distance_to_voronezh, data$daily_internet_usage_pc, method = "spearman")
## Warning in cor.test.default(data$distance_to_voronezh,
## data$daily_internet_usage_pc, : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  data$distance_to_voronezh and data$daily_internet_usage_pc
## S = 1697.8, p-value = 0.03287
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4195589
cor.test(data$prisoners_per_100K, data$daily_internet_usage_pc, method = "spearman")
## Warning in cor.test.default(data$prisoners_per_100K,
## data$daily_internet_usage_pc, : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  data$prisoners_per_100K and data$daily_internet_usage_pc
## S = 3971.2, p-value = 0.07282
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## -0.357668

For distance_to_voronezh and daily_internet_usage_pc we reject the null hypothesis of having 0 correlation in favor of alternative, that monotonic relationship exist. For other two tests we fail to reject the hypothesis that there is no monotonic or linear relation between these pairs, however we cannot say anything about whether two of the regressors can explain the other.

For our two categorical variables we can test something stronger than correlation, that would be independence. Our options are Chi-Square Test of Independence or Fisher’s Exact Test. Let’s see if the assumption for Chi-Square Test holds:

data.short <- data
data.short$nace_r2_sector <- as.character(data.short$nace_r2_sector)

data.short$nace_r2_sector[data.short$nace_r2_sector == "Financial and insurance activities"] <- "Finance"
data.short$nace_r2_sector[data.short$nace_r2_sector == "Public administration, education, health"] <- "Administration"
data.short$nace_r2_sector[data.short$nace_r2_sector == "Wholesale, retail, transport, accommodation"] <- "Wholesale"

data.short$nace_r2_sector <- factor(data.short$nace_r2_sector)

ct.table <- table(data.short$lyhin_preference, data.short$nace_r2_sector)
ct.table
##         
##          Administration Finance Industry Wholesale
##   Low                 1       0        3         3
##   Medium              2       1        3         7
##   High                1       0        3         2

Almost all counts in the table are < 5, so it is better to use Fisher’s Exact Test.

fisher.test(ct.table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  ct.table
## p-value = 0.9303
## alternative hypothesis: two.sided

The p-value is very high, so we fail to reject the null hypothesis of independence (i.e., there is no evidence of a statistically significant association between the two categorical variables).

And lastly we will compare means of numerical variables across different categories of nace_r2_sector (we will not test numerical variables relationsip with Lyhin preference, because latter is far too exceptional to be tested here). We start by plotting boxplots.

par(mar = c(5, 4, 4, 8), xpd = TRUE)

boxplot(
    distance_to_voronezh ~ nace_r2_sector, data = data.short,
    main = "Distance to Voronezh by NACE2 A10 Categories",
    ylab = "Distance to Voronezh"
)
group_means <- tapply(data.short$distance_to_voronezh, data.short$nace_r2_sector, mean, na.rm = TRUE)
points(1:length(group_means), group_means, col = "coral", pch = 19)

boxplot(
    prisoners_per_100K ~ nace_r2_sector, data = data.short,
    main = "Prisoners per 100K by NACE2 A10 Categories",
    ylab = "Prisoners per 100K"
)
group_means <- tapply(data.short$prisoners_per_100K, data.short$nace_r2_sector, mean, na.rm = TRUE)
points(1:length(group_means), group_means, col = "coral", pch = 19)

boxplot(
    daily_internet_usage_pc ~ nace_r2_sector, data = data.short,
    main = "Daily internet usage pc by NACE2 A10 Categories",
    ylab = "Daily internet usage pc"
)
group_means <- tapply(data.short$daily_internet_usage_pc, data.short$nace_r2_sector, mean, na.rm = TRUE)
points(1:length(group_means), group_means, col = "coral", pch = 19)

The plots above show that samples in Administration and Finance categories have comparable means (orange dots) and means are similar also for Industry and Wholesale. Now let’s see what ANOVA will tell us about means. ANOVA tests null hypothesis of numerical values means being equal in different categories and alternative hypothesis that at least one mean is different.

But first we need to test the assumptions of normality of residuals and homogenity of variances. But one category (Finance) has only one observation, so we cannot find estimation of its variance. So we will exclude this category and we will compare only means in the three others.

data.short <- data.short[data.short$nace_r2_sector != 'Finance',]
data.short
## # A tibble: 25 × 7
##    geo      values distance_to_voronezh lyhin_preference prisoners_per_100K
##    <chr>     <dbl>                <dbl> <ord>                         <dbl>
##  1 Austria   44646                2980. High                          104. 
##  2 Belgium   46422                3865. Low                            90.0
##  3 Bulgaria   8127                2402. Low                            94.3
##  4 Cyprus    21472                1679. High                           71.6
##  5 Czechia   16137                3147. High                          203. 
##  6 Germany   43340                3263. Medium                         79.4
##  7 Denmark   59519                3400. High                           62.9
##  8 Estonia   17086                2940. Medium                        196. 
##  9 Greece    15920                2423. Medium                         99.2
## 10 Spain     27426                4652. Medium                        126. 
## # ℹ 15 more rows
## # ℹ 2 more variables: nace_r2_sector <fct>, daily_internet_usage_pc <dbl>

For distance_to_voronezh ~ nace_r2_sector cannot use ANOVA, because residuals are not normal (Shapiro-Wilk test rejected null hypothesis):

fit.distance_to_voronezh <- lm(distance_to_voronezh ~ nace_r2_sector, data = data.short)
shapiro.test(fit.distance_to_voronezh$resid)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit.distance_to_voronezh$resid
## W = 0.87192, p-value = 0.004728

So we should use non parametric Kruskal-Wallis test instead of ANOVA.
As for other variables we can use ANOVA, because after applying normality test and homogenity of variance test (Bartlett) we fail to reject null hypothesis:

fit.prisoners_per_100K <- lm(prisoners_per_100K ~ nace_r2_sector, data = data.short)
shapiro.test(fit.prisoners_per_100K$resid)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit.prisoners_per_100K$resid
## W = 0.92336, p-value = 0.06114
bartlett.test(prisoners_per_100K ~ nace_r2_sector, data = data.short)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  prisoners_per_100K by nace_r2_sector
## Bartlett's K-squared = 2.6282, df = 2, p-value = 0.2687
fit.daily_internet_usage_pc <- lm(daily_internet_usage_pc ~ nace_r2_sector, data = data.short)
shapiro.test(fit.daily_internet_usage_pc$resid)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit.daily_internet_usage_pc$resid
## W = 0.98259, p-value = 0.9311
bartlett.test(daily_internet_usage_pc ~ nace_r2_sector, data = data.short)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  daily_internet_usage_pc by nace_r2_sector
## Bartlett's K-squared = 0.74888, df = 2, p-value = 0.6877
kruskal.test(distance_to_voronezh ~ nace_r2_sector, data = data.short)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  distance_to_voronezh by nace_r2_sector
## Kruskal-Wallis chi-squared = 4.0805, df = 2, p-value = 0.13
anova(aov(prisoners_per_100K ~ nace_r2_sector, data = data.short))
## Analysis of Variance Table
## 
## Response: prisoners_per_100K
##                Df Sum Sq Mean Sq F value Pr(>F)
## nace_r2_sector  2   9715  4857.4  1.9106 0.1718
## Residuals      22  55931  2542.3
anova(aov(daily_internet_usage_pc ~ nace_r2_sector, data = data.short))
## Analysis of Variance Table
## 
## Response: daily_internet_usage_pc
##                Df  Sum Sq Mean Sq F value Pr(>F)  
## nace_r2_sector  2  673.29  336.64  4.7124 0.0198 *
## Residuals      22 1571.63   71.44                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Only for daily internet usage with p-value = 0.02 < \(a =0.05\), we reject the null hypothesis and conclude that nace_r2_sector has a statistically significant effect on the mean of daily internet usage.

Modeling

We will build model without interactions, because we have a big number of regressors and with interactions the number of coefficients will explode and it will be hard to interpret them.

full_model <- lm(values ~ distance_to_voronezh + lyhin_preference + prisoners_per_100K + nace_r2_sector + daily_internet_usage_pc, data=data)
summary(full_model)
## 
## Call:
## lm(formula = values ~ distance_to_voronezh + lyhin_preference + 
##     prisoners_per_100K + nace_r2_sector + daily_internet_usage_pc, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8768.5 -4851.1   -62.4  3896.2 11281.7 
## 
## Coefficients:
##                                                             Estimate Std. Error
## (Intercept)                                                -3105.952  18647.576
## distance_to_voronezh                                           4.031      1.872
## lyhin_preference.L                                          2024.363   3162.588
## lyhin_preference.Q                                          1813.247   2399.589
## prisoners_per_100K                                           -98.668     30.690
## nace_r2_sectorIndustry                                    -25147.759   7395.114
## nace_r2_sectorPublic administration, education, health    -20932.406   7453.735
## nace_r2_sectorWholesale, retail, transport, accommodation -29737.164   7530.687
## daily_internet_usage_pc                                      762.698    187.598
##                                                           t value Pr(>|t|)    
## (Intercept)                                                -0.167 0.869681    
## distance_to_voronezh                                        2.153 0.045940 *  
## lyhin_preference.L                                          0.640 0.530643    
## lyhin_preference.Q                                          0.756 0.460203    
## prisoners_per_100K                                         -3.215 0.005082 ** 
## nace_r2_sectorIndustry                                     -3.401 0.003403 ** 
## nace_r2_sectorPublic administration, education, health     -2.808 0.012092 *  
## nace_r2_sectorWholesale, retail, transport, accommodation  -3.949 0.001036 ** 
## daily_internet_usage_pc                                     4.066 0.000804 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6455 on 17 degrees of freedom
## Multiple R-squared:  0.8939, Adjusted R-squared:  0.8439 
## F-statistic:  17.9 on 8 and 17 DF,  p-value: 7.434e-07
print(sprintf("RMSE of model: %.2f", sqrt(mean(full_model$residuals ^ 2))))
## [1] "RMSE of model: 5219.36"

Individual Regressors

Statistically Significant Predictors:

  • distance_to_voronezh: Each additional kilometer from Voronezh is associated with an increase of about 4 EUR in average annual wages per person.

  • prisoners_per_100K: An increase of one prisoner per 100 000 inhabitants corresponds to a decrease of roughly 99 EUR in per‑capita annual wages.

    Next are coefficients from nace_r2_sector feature, as a baseline for comparison was chosen Financial and insurance activities category (lexicographically):

  • nace_r2_sectorIndustry: Regions where the predominant sector is Industry tend to have average annual wages lower by about 25 148 EUR compared to the reference sector.

  • nace_r2_sectorPublic administration, education, health: The Public administration, education, and health sector is associated with per‑capita wages around 20 932 EUR below the baseline.

  • nace_r2_sectorWholesale, retail, transport, accommodation: This sector shows the largest negative effect – approximately 29 737 EUR lower average wages than the baseline.

    And the last significant regressor:

  • daily_internet_usage_pc: A one‑percentage‑point increase in daily internet usage is linked to about 763 EUR higher annual wages per person.

NOT Statistically Significant:

  • Intercept: estimated as -3105 EUR however has a big standard error and is insignificant.
For ordinal feature *lyhin_preference* R gave an interesting output with .L and .Q suffixes. This is because internally R uses orthogonal polynomial contrasts for ordinal features to capture additional valuable information. So we got:
  • lyhin_preference.L: There’s a insignificant linear trend, as go from low to mid to high, the outcome tends to increase by 2024 EUR per step.

  • lyhin_preference.Q: Estimate is 1813, this means that the middle level (e.g., Medium category) is higher than expected if the trend was purely linear, but there is no strong evidence that the relationship between this predictor and regressand is U-shaped.

Model Performance

  • Adjusted R-squared: 0.8439 - The model explains about 84.4% of the variability in average annual wages per-capita, after accounting for the number of predictors.

  • RMSE: 5 219.36 EUR - On average, predicted wages deviate from observed values by about 5 219 EUR, reflecting a reasonably tight fit given the scale of wages.

This regression model indicates that macro‑economic and demographic factors are strong determinants of average annual wages per person (in EUR). Distance from Voronezh also plays a modest but significant role, while the preference of our dear Mr. Lyhin shows no meaningful impact on wages.

Outliers

plot(full_model, which=4)

plot(full_model, which=5)

largest_cookd <-  data[c(4, 5, 22),]
largest_cookd$modeled_values <-  predict(full_model, newdata=largest_cookd)
largest_cookd <- largest_cookd[c("geo", "values", "modeled_values")]
largest_cookd
## # A tibble: 3 × 3
##   geo      values modeled_values
##   <chr>     <dbl>          <dbl>
## 1 Cyprus    21472         27597.
## 2 Czechia   16137         24082.
## 3 Portugal  17601         26370.

The Cook’s distance diagnostics Portugal (22) as one of the most influential observations in the model.
On the Residuals vs. Leverage plot, Portugal (22) sits in the lower‑right corner – characterized by:

Multicolinearity

An overview of how much each regressor depends on the others is provided by the variance inflation factor, which indicates how well the j-th regressor is explained by the other predictors. In our case, however, we will use adjusted generalized VIF, because we a have factor variable.

vif(full_model, type = "predictor")
## GVIFs computed for predictors
##                             GVIF Df GVIF^(1/(2*Df)) Interacts With
## distance_to_voronezh    1.264712  1        1.124594           --  
## lyhin_preference        2.016928  2        1.191716           --  
## prisoners_per_100K      1.486679  1        1.219294           --  
## nace_r2_sector          2.444934  3        1.160676           --  
## daily_internet_usage_pc 2.040475  1        1.428452           --  
##                                                                                            Other Predictors
## distance_to_voronezh          lyhin_preference, prisoners_per_100K, nace_r2_sector, daily_internet_usage_pc
## lyhin_preference          distance_to_voronezh, prisoners_per_100K, nace_r2_sector, daily_internet_usage_pc
## prisoners_per_100K          distance_to_voronezh, lyhin_preference, nace_r2_sector, daily_internet_usage_pc
## nace_r2_sector          distance_to_voronezh, lyhin_preference, prisoners_per_100K, daily_internet_usage_pc
## daily_internet_usage_pc          distance_to_voronezh, lyhin_preference, prisoners_per_100K, nace_r2_sector

We can be confident that coefficient estimates and standard errors aren’t being inflated by collinearity among these variables. (All adjusted GVIF values are < 2 which in literature is considered a good result).

Homoscedasticity

Let’s see how residuals change with fitted values to see if there are changes in variance.

plot(full_model, which=3)
## Warning: not plotting observations with leverage one:
##   18

largest_scale <-  data[c(5, 22, 1),]
largest_scale$modeled_values <-  predict(full_model, newdata=largest_scale)
largest_scale <- largest_scale[c("geo", "values", "modeled_values")]
largest_scale
## # A tibble: 3 × 3
##   geo      values modeled_values
##   <chr>     <dbl>          <dbl>
## 1 Czechia   16137         24082.
## 2 Portugal  17601         26370.
## 3 Austria   44646         33364.

The red line shows some fluctuations, which can suggest heteroscedasticity. It is not obvious what to conclude so we will use Breusch-Pagan test.

bptest(full_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  full_model
## BP = 14.347, df = 8, p-value = 0.07316

At the 5% significance level, we fail to reject the null hypothesis of homoscedasticity (constant variance), since \(p=0.073>0.05\). In other words, there’s no strong statistical evidence of heteroscedasticity in our model’s residuals.

Normality of Residuals

We will use two plots: Residuals vs Fitted and Q-Q Plot.

plot(full_model, which=1)

plot(full_model, which=2)

Plots:

  1. Residuals vs. Fitted: No obvious skew or systematic curvature in the residual–fitted plot.

  2. Normal Q–Q Plot: Most points hug the 45° reference line, with only minor deviations at the extremes – consistent with approximate normality.

This suggest that residuals should be normal, but we can also use test to see that.

shapiro.test(full_model$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  full_model$residuals
## W = 0.97171, p-value = 0.6681

Shapiro–Wilk Test - Since \(p=0.6681>0.05\), we fail to reject the null hypothesis that the residuals come from a normal distribution.

The model’s residuals appear to be reasonably normally distributed, satisfying the normality assumption for linear regression.

Choosing submodel

The only assumption that was not met for using linear regression was existence of outliers. But our task is to estimate salary for each european country and it will be unwise to remove some countries. Besides, influence of the outliers is not very severe.

We can try to reduce this model to a submodel using function step(), which greedily goes through the variables step by step and If it encounters one whose inclusion reduces the AIC, it will add it to the model and continue until the optimal version is reached:

step(lm(values ~ 1, data = data), 
     scope = list(lower = ~ 1, upper = ~ distance_to_voronezh + lyhin_preference + prisoners_per_100K + nace_r2_sector + daily_internet_usage_pc), 
     direction = "both", 
     data = data)
## Start:  AIC=505.45
## values ~ 1
## 
##                           Df  Sum of Sq        RSS    AIC
## + daily_internet_usage_pc  1 4070215215 2603575315 482.97
## + nace_r2_sector           3 3927632971 2746157559 488.36
## + prisoners_per_100K       1 2035394659 4638395871 497.99
## + distance_to_voronezh     1 1263621876 5410168654 501.99
## + lyhin_preference         2 1192947896 5480842634 504.33
## <none>                                  6673790530 505.45
## 
## Step:  AIC=482.97
## values ~ daily_internet_usage_pc
## 
##                           Df  Sum of Sq        RSS    AIC
## + prisoners_per_100K       1  801779915 1801795400 475.40
## + nace_r2_sector           3 1037080762 1566494553 475.76
## + distance_to_voronezh     1  440579320 2162995995 480.15
## <none>                                  2603575315 482.97
## + lyhin_preference         2  110372905 2493202410 485.85
## - daily_internet_usage_pc  1 4070215215 6673790530 505.45
## 
## Step:  AIC=475.4
## values ~ daily_internet_usage_pc + prisoners_per_100K
## 
##                           Df  Sum of Sq        RSS    AIC
## + nace_r2_sector           3  784633670 1017161730 466.54
## + distance_to_voronezh     1  381420062 1420375338 471.22
## <none>                                  1801795400 475.40
## + lyhin_preference         2   48086233 1753709168 478.70
## - prisoners_per_100K       1  801779915 2603575315 482.97
## - daily_internet_usage_pc  1 2836600471 4638395871 497.99
## 
## Step:  AIC=466.54
## values ~ daily_internet_usage_pc + prisoners_per_100K + nace_r2_sector
## 
##                           Df  Sum of Sq        RSS    AIC
## + distance_to_voronezh     1  258713347  758448383 460.91
## <none>                                  1017161730 466.54
## + lyhin_preference         2  115688334  901473396 467.40
## - nace_r2_sector           3  784633670 1801795400 475.40
## - prisoners_per_100K       1  549332823 1566494553 475.76
## - daily_internet_usage_pc  1 1026278794 2043440524 482.67
## 
## Step:  AIC=460.91
## values ~ daily_internet_usage_pc + prisoners_per_100K + nace_r2_sector + 
##     distance_to_voronezh
## 
##                           Df Sum of Sq        RSS    AIC
## <none>                                  758448383 460.91
## + lyhin_preference         2  50163408  708284975 463.13
## - distance_to_voronezh     1 258713347 1017161730 466.54
## - nace_r2_sector           3 661926956 1420375338 471.22
## - prisoners_per_100K       1 538600117 1297048499 472.86
## - daily_internet_usage_pc  1 927470921 1685919303 479.67
## 
## Call:
## lm(formula = values ~ daily_internet_usage_pc + prisoners_per_100K + 
##     nace_r2_sector + distance_to_voronezh, data = data)
## 
## Coefficients:
##                                               (Intercept)  
##                                                 -6841.814  
##                                   daily_internet_usage_pc  
##                                                   774.653  
##                                        prisoners_per_100K  
##                                                   -98.562  
##                                    nace_r2_sectorIndustry  
##                                                -23304.347  
##    nace_r2_sectorPublic administration, education, health  
##                                                -19760.077  
## nace_r2_sectorWholesale, retail, transport, accommodation  
##                                                -28517.991  
##                                      distance_to_voronezh  
##                                                     4.355

Key AIC Improvements:

  • Adding daily_internet_usage_pc dropped AIC from 505.45 to 482.97, the largest single improvement.

  • Then prisoners_per_100K further cut AIC to 475.40.

  • Including nace_r2_sector (all three dummy contrasts) brought AIC down to 466.54.

  • Finally distance_to_voronezh produced the lowest AIC of 460.91

So we can interpret those improvements as:

  • Internet Usage and our very original Distance to Voronezh remain strong positive predictors.

  • Prisoner Rate retains its negative association.

  • Sectoral Dummies account for large inter‐sector wage differences and are also informative for the model.

  • lyhin_preference was never retained, indicating it adds little explanatory power beyond these four core factors.

Exclusion of lyhin_preference

In each step of the AIC‐driven selection process, adding the linear and quadratic contrasts for lyhin_preference failed to reduce the AIC beyond what was achieved with the core predictors. Moreover, in the full model, both the linear and quadratic terms of lyhin_preference had high p‑values, providing no evidence that preference of our good old Mr. Lyhin explains additional variation in average wages.

Therefore, dropping lyhin_preference yields a simpler model without sacrificing explanatory power.

submodel <- lm(formula = values ~ daily_internet_usage_pc + prisoners_per_100K +  nace_r2_sector + distance_to_voronezh, data = data)
summary(submodel)
## 
## Call:
## lm(formula = values ~ daily_internet_usage_pc + prisoners_per_100K + 
##     nace_r2_sector + distance_to_voronezh, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7494.0 -4168.6  -698.8  3028.9 13464.1 
## 
## Coefficients:
##                                                             Estimate Std. Error
## (Intercept)                                                -6841.814  16544.850
## daily_internet_usage_pc                                      774.653    160.710
## prisoners_per_100K                                           -98.562     26.833
## nace_r2_sectorIndustry                                    -23304.347   7044.589
## nace_r2_sectorPublic administration, education, health    -19760.077   7121.155
## nace_r2_sectorWholesale, retail, transport, accommodation -28517.991   7184.646
## distance_to_voronezh                                           4.355      1.711
##                                                           t value Pr(>|t|)    
## (Intercept)                                                -0.414 0.683848    
## daily_internet_usage_pc                                     4.820 0.000119 ***
## prisoners_per_100K                                         -3.673 0.001615 ** 
## nace_r2_sectorIndustry                                     -3.308 0.003697 ** 
## nace_r2_sectorPublic administration, education, health     -2.775 0.012066 *  
## nace_r2_sectorWholesale, retail, transport, accommodation  -3.969 0.000822 ***
## distance_to_voronezh                                        2.546 0.019734 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6318 on 19 degrees of freedom
## Multiple R-squared:  0.8864, Adjusted R-squared:  0.8505 
## F-statistic:  24.7 on 6 and 19 DF,  p-value: 5.181e-08
print(sprintf("RMSE of model: %.2f", sqrt(mean(submodel$residuals ^ 2))))
## [1] "RMSE of model: 5401.03"

Submodel Model Performance

  • Adjusted R-squared: 0.8505 - Explains 85.05% of wage variability – slightly above the full model’s 84.39%, despite dropping predictors.

  • RMSE: 5 401.03 EUR - A modest increase from the full model’s 5 219.36 EUR, reflecting a small trade‑off in average error for greater simplicity.

The submodel maintains strong fit. It’s balanced performance makes it a preferred choice for predicting annual wages.

anova(submodel, full_model)
## Analysis of Variance Table
## 
## Model 1: values ~ daily_internet_usage_pc + prisoners_per_100K + nace_r2_sector + 
##     distance_to_voronezh
## Model 2: values ~ distance_to_voronezh + lyhin_preference + prisoners_per_100K + 
##     nace_r2_sector + daily_internet_usage_pc
##   Res.Df       RSS Df Sum of Sq     F Pr(>F)
## 1     19 758448383                          
## 2     17 708284975  2  50163408 0.602  0.559

The partial F‐test confirms that adding the two lyhin_preference contrasts doesn’t significantly improve fit. Since \(p>0.05\), we fail to reject the null hypothesis that the smaller model (without lyhin_preference) is as good as the full model. In other words, those two terms do not explain a statistically significant amount of additional variance, reinforcing their exclusion for simplicity.

Conclusion

Our selection of regressors resulted in a linear model that effectively describes average salary in European countries. It was surprising to discover that chosen regressors was all unique and strong in their own way. There was little correlation between them and this was one of the reasons why almost all assumption were satisfied for the use of linear model.